This tutorial will focus on analysing the updated data of the worldwide Novel Corona virus (COVID-19) pandemic.
There are several data sources available online. We will use the data collected from a range of sources by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) and hosted on their GitHub repository. We will also use daily COVID-19 data from the European Centre for Disease Prevention and Control website Note that both data sources are lagging 1-2 days behind our current date!!
Start RStudio and create a new project named Workshop3 in a new folder (if you need a reminder ho to do it, check out Workshop1 Tutorial on BB).
Once RStudio restarts inside the project’s folder, create a new R script named Workshop3.R and 2 new folders, one named data for our input data and another named output for our plots.
For this analysis we will again use some packages from the Tidyverse, but this time we load the specific packages (which are supposed to be pre-installed on your computers) to try and avoid having to download the entire tidyverse. In addition to the Tidyverse packages we’ve got to know in the previous workshop we will use the plotly package to create interactive plots, paletteer for custom color palettes, readxl to read MS-Excel file, scales to format large numbers, lubridate to better handle dates, glue to paste together strings, highcharter to plot the data on a world map and a few others.
To install these packages, we will introduce a package called pacman that will assist in loading the required packages and installing them if they’re not already installed. To install it we use the install.packages('pacman') command, please note that the package name need to be quoted and that we only need to perform it once, or when we want or need to update the package. Once the package was installed, we can load its functions using the library(pacman) command and then load/install all the other packages at once with p_load() function.
# install required packages - needed only once! (comment with a # after first use)
install.packages('pacman')
# load required packages
library(pacman)
required_packages <- c("dplyr", "ggplot2", "paletteer", "stringr", "readr","readxl", "glue",
"highcharter", "forcats", "scales", "plotly", "lubridate", "here")
p_load(char=required_packages)More information on installing and using R packages can be found in this tutorial.
Now that we’ve got RStudio up and running and our packages installed and loaded, we can read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.). Please download the data files from Blackboard and put them in the data folder.
We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data from a file on our computer into a variable of type data frame (table). If we don’t want to use external packages, we can use the read.csv() function from base R, which will slightly change the structure of the resulting data frame (will convert all text columns into factors and won’t automatically parse columns containing dates).
> Note that the path to the files can be either relative or absolute to our current working directory and that we use a forward slash / (Unix/Mac style) and not a backslash \ (Windows style), to make life easier, make use of RStudio’s auto-completion feature and the here package._
Let’s use built-in functions for a brief data exploration, such as head() to show the first 10 rows of the data and str() for the type of data in each column:
## # A tibble: 6 x 5
## Country_Region Date Confirmed Deaths Recovered
## <chr> <date> <dbl> <dbl> <dbl>
## 1 Afghanistan 2020-01-22 0 0 0
## 2 Afghanistan 2020-01-23 0 0 0
## 3 Afghanistan 2020-01-24 0 0 0
## 4 Afghanistan 2020-01-25 0 0 0
## 5 Afghanistan 2020-01-26 0 0 0
## 6 Afghanistan 2020-01-27 0 0 0
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 39292 obs. of 5 variables:
## $ Country_Region: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Date : Date, format: "2020-01-22" "2020-01-23" ...
## $ Confirmed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Deaths : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Recovered : num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. `Country/Region` = col_character(),
## .. Date = col_date(format = ""),
## .. Confirmed = col_double(),
## .. Deaths = col_double(),
## .. Recovered = col_double()
## .. )
We can also produce some descriptive statistics to better understand the data and the nature of each variable. The summary() function (as can be guessed by its name) provides a quick summary of basic descriptive statistics, such as the mean, min, max and quantiles for continuous numerical values.
## Country_Region Date Confirmed Deaths
## Length:39292 Min. :2020-01-22 Min. : 0 Min. : 0
## Class :character 1st Qu.:2020-03-14 1st Qu.: 6 1st Qu.: 0
## Mode :character Median :2020-05-05 Median : 427 Median : 7
## Mean :2020-05-05 Mean : 31433 Mean : 1485
## 3rd Qu.:2020-06-26 3rd Qu.: 5488 3rd Qu.: 110
## Max. :2020-08-17 Max. :5438325 Max. :170497
## Recovered
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 81.5
## Mean : 16120.7
## 3rd Qu.: 1821.0
## Max. :2699080.0
We can see that most of our data contains ‘0’ (check the median of Confirmed column). Just to confirm that, let’s plot a histogram of all the confirmed cases
What are the metadata columns that describe our observations?
Country
Date
The data is evolving over a time-series, to there’s no point treating it as a random population sample.
Let’s look at confirmed cases data for the 10 most affected countries (to date). To find out these countries so we need to wrangle our data a little bit using the following steps:
group_by()arrange(desc(Date))slice(1) and remove grouping with ungroup()$ signc()filter()Optional step:
fct_reorder()Then we can look at the data as a table and make a plot with the number of cases in the y-axis and date in the x-axis.
# find the 10 most affected countries (to date)
latest_data <- covid_data %>% group_by(Country_Region) %>% arrange(desc(Date)) %>% slice(1) %>% ungroup()
most_affected_countries <- latest_data %>% arrange(desc(Confirmed)) %>% slice(1:10) %>% .$Country_Region
# have a look at the data as a table
latest_data %>% arrange(desc(Confirmed)) %>% filter(Country_Region %in% c(most_affected_countries, "Australia")) ## # A tibble: 11 x 5
## Country_Region Date Confirmed Deaths Recovered
## <chr> <date> <dbl> <dbl> <dbl>
## 1 US 2020-08-17 5438325 170497 1865580
## 2 Brazil 2020-08-17 3359570 108536 2699080
## 3 India 2020-08-17 2702681 51797 1977671
## 4 Russia 2020-08-17 925558 15707 734573
## 5 South Africa 2020-08-17 589886 11982 477671
## 6 Peru 2020-08-17 535946 26281 370717
## 7 Mexico 2020-08-17 525733 57023 430840
## 8 Colombia 2020-08-17 476660 15372 301525
## 9 Chile 2020-08-17 387502 10513 360385
## 10 Spain 2020-08-17 359082 28646 150376
## 11 Australia 2020-08-17 23773 438 14536
# subset just the data from the 10 most affected countries and Australia order them from the most affected to the least one
most_affected_data <- covid_data %>%
filter(Country_Region %in% c(most_affected_countries, "Australia")) %>%
mutate(Country=fct_reorder(factor(Country_Region), Confirmed, .desc = TRUE))
# create a line plot the data
ggplot(most_affected_data, aes(x=Date, y=Confirmed, colour=Country)) +
geom_line(size=1) +
scale_color_paletteer_d("ggsci::springfield_simpsons") +
labs(color="Country") +
theme_bw(16)It’s a bit hard to figure out how the pandemic evolved because the numbers in US, Brazil and India are in order of magnitude larger than the rest (which are very close to each other). How can we make it more visible?
# create the plot
plot <- ggplot(most_affected_data, aes(x=Date, y=Confirmed, colour=Country)) +
geom_line(size=0.6) + scale_y_log10(labels=comma) +
scale_color_paletteer_d("ggsci::springfield_simpsons") +
labs(color="Country") +
theme_bw(16)
# show an interactive plot
ggplotly(plot)## Warning: Transformation introduced infinite values in continuous y-axis
Why did we get a warning message? How can we solve it? What can we infer from the graph (exponential increase)?
What happens when we take the log of 0?? Can we remove those 0s with the `filter()` function?
We can see a very similar trend for most countries and while the curve is flattening and passed the exponential growth phase, the numbers are still rising rapidly in India, Colombia and even Spain (though on a smaller scale).
To be able to compare the disease progress between countries, we could “normalise” the data to show the epidemic spread using a similar baseline, for example, counting the days since the 100th confirmed case in each country.
baselined_data <- most_affected_data %>% filter(Confirmed>=100) %>% group_by(Country_Region) %>%
mutate(Days_since_100=difftime(Date,first(Date), units = "days")) %>% ungroup()
# create the plot
plot <- ggplot(baselined_data, aes(x=Days_since_100, y=Confirmed, colour=Country)) +
geom_line(size=0.6) + scale_y_log10(labels=comma) +
scale_color_paletteer_d("ggsci::springfield_simpsons") +
labs(color="Country", x="Days since the 100th patient") +
theme_bw(16)
# show an interactive plot
ggplotly(plot)We can also look at the change in the number of cases relative to the population size of each country. To do so we will read another dataset of daily new cases collected by the European Centre for Disease Prevention and Control, which include the population of each country (as of 2019).
We can read it directly from their website as a .csv file (as below) or we can download the most recent daily COVID-19 data from the website (download the Excel file and place it in the data folder, then use read_excel() instead of read_csv()).
We need to wrangle this one up as well to make sure that the countries names are consistent (try to open the file in excel file and find the issue) and the dates are parsed correctly. We will then calculate the number of accumulated cases and deaths using mutate(Total_cases=cumsum(Cases), Total_Deaths=cumsum(Deaths) as well as the number of cases per million people in each country.
eu_data <- read_csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv", na = "")
glimpse(eu_data)## Observations: 37,029
## Variables: 12
## $ dateRep <chr> "18/08...
## $ day <dbl> 18, 17...
## $ month <dbl> 8, 8, ...
## $ year <dbl> 2020, ...
## $ cases <dbl> 3, 45,...
## $ deaths <dbl> 0, 5, ...
## $ countriesAndTerritories <chr> "Afgha...
## $ geoId <chr> "AF", ...
## $ countryterritoryCode <chr> "AFG",...
## $ popData2019 <dbl> 380417...
## $ continentExp <chr> "Asia"...
## $ `Cumulative_number_for_14_days_of_COVID-19_cases_per_100000` <dbl> 2.2396...
covid_daily_data <- eu_data %>%
# mutate(Date=as.Date(paste(day, month, year, sep = "/"), "%d/%m/%y")) %>%
mutate(Date=dmy(dateRep)) %>%
rename(Country=countriesAndTerritories) %>%
mutate(Country=str_to_title(gsub("_", " ", Country))) %>%
group_by(Country) %>% arrange(Country, Date) %>%
mutate(Total_cases=cumsum(cases), Total_Deaths=cumsum(deaths), Cases_per_mill=Total_cases*1e6/popData2019,
geoId = case_when(geoId=="UK"~"GB",
geoId=="EL"~"GR",
geoId=="PS"~"WE",
geoId=="XK"~"KV",
geoId=="JPG11668"~"UM",
TRUE~geoId)) %>%
ungroup()
# use the line below if using the Excel file downloaded from the web page
# read_excel("data/COVID-19-geographic-disbtribution-worldwide-2020-03-22.xlsx")
# iso3c=countrycode(GeoId, origin = 'iso2c', destination = 'iso3c')) %>%
# ungroup()Similar to what we did earlier, we’ll group the data by Country and sort it by Date to select the most recent data that we’ll use to select the most affected countries.
# find the 10 most affected countries and order them from the most affected to the least one
worst_by_pop_data <- covid_daily_data %>% group_by(Country) %>% arrange(desc(Date)) %>% slice(1) %>%
ungroup() %>% filter(popData2019>100000) %>% arrange(desc(Cases_per_mill)) %>% slice(1:11)
# subset the daily data and add Australia as well
plot_data <- covid_daily_data %>% filter(Country %in% c(worst_by_pop_data$Country, "Australia")) %>%
mutate(Country=fct_reorder(factor(Country), Cases_per_mill, .desc = TRUE)) %>%
filter(Date>as.Date("2020-04-01"))
# create the plot
plot <- ggplot(plot_data, aes(x=Date, y=Cases_per_mill, colour=Country)) +
geom_line(size=0.6) + scale_y_continuous(labels=comma) +
scale_color_paletteer_d("rcartocolor::Bold") + # ggsci::springfield_simpsons
labs(x="Date", y="Cases per million people") +
theme_bw(16)
# show an interactive plot
ggplotly(plot)1. Because the country names do not necessarily match (check how USA appears in both datasets)
2. Mortalities (Case Fatality Rate), cases per population density (population/area)?
3. Suggestions?
We can look at the average cases by continent and how the pandemic spread across the globe and the different responses of the continents.
cont_daily_data <- covid_daily_data %>% group_by(continentExp, Date) %>%
summarise_at(c("Total_cases", "Total_Deaths"), ~mean(., na.rm=TRUE)) %>% filter(Total_cases>0)
# create the plot
cont_plot <- ggplot(cont_daily_data, aes(x=Date, y=Total_cases, colour=continentExp)) +
geom_line(size=0.6) + scale_y_log10(labels=comma) +
scale_color_paletteer_d("rcartocolor::Bold") +
labs(color="Continent", y="Mean confirmed cases") +
theme_bw(16)
# show an interactive plot
ggplotly(cont_plot)The BBC has a website with maps and charts detailing the responses to COVID-19 by continent in this webpage. Using this information we can create a small table with the approximate time that each continent applied severe social distancing restrictions or went into lockdown.
lockdown_data <- tibble(continentExp=unique(cont_daily_data$continentExp),
Date=c(dmy("20/03/2020"),
dmy("15/03/2020"),
dmy("1/03/2020"),
dmy("10/03/2020"),
dmy("20/03/2020"), NA)) %>%
inner_join(cont_daily_data)
# create the plot
cont_plot <- ggplot(cont_daily_data, aes(x=Date, y=Total_cases, colour=continentExp)) +
geom_line(size=0.75) + scale_y_log10(labels=comma) +
scale_color_paletteer_d("rcartocolor::Bold") +
labs(color="Continent", y="Mean confirmed cases") +
geom_segment(data = lockdown_data,
mapping = aes(x=Date, xend=Date, colour=continentExp, yend = 0), lty="longdash",
size=0.65, show.legend = FALSE) +
annotate("text", x = dmy("15/04/2020"), y=1, label="Lockdown") +
theme_bw(16)
# show an interactive plot
cont_plot## Warning: Transformation introduced infinite values in continuous y-axis
We can use the same data to visualise the impact of COVID-19 on a global/regional map to grasp its worldwide effect.
We will create a color scale to represent the number of cases and appropriate popup information tags for each country.
# download map data
world_map <- download_map_data("custom/world-palestine-highres")
mapdata <- get_data_from_map(world_map)
# select only most current data
choroplet_data <- covid_daily_data %>% group_by(Country) %>% arrange(desc(Date)) %>%
slice(1) %>% filter(Total_cases>0) %>% mutate(log_cases=log(Total_cases),
CFR=percent(Total_Deaths/Total_cases, accuracy=.01),
Total_cases=number(Total_cases,big.mark = ","),
Total_Deaths = number(Total_Deaths,big.mark = ","),
Cases_per_mill=number(Cases_per_mill, accuracy = .01,big.mark = ","),
pop=number(popData2019,big.mark = ","))
# # define colors
bins <- 10^(0:7)# c(1, 10, 100, 1000, 10000, 100000, 1000000)
cols <- as.character(paletteer_c("viridis::inferno", length(bins), direction = -1))
stops <- data.frame(q=1:length(bins)/length(bins), c=cols) %>% list_parse2(.)
# plot map
highchart(type = "map", hc_opts = list(caption=glue("<b>Number of confirmed COVID-19 cases by Country</b><br/>An up-to-date summary of daily data obtained from the European Centre for Disease Prevention and Control<br/>Data was last updated on {format(max(choroplet_data$Date), '%d/%m/%Y')}."))) %>%
hc_add_series_map(map = world_map, df = choroplet_data,
value = "log_cases", joinBy = c("iso-a2", "geoId")) %>%
hc_colorAxis(stops = color_stops(length(bins), cols), tickPositions=log(bins), showLastLabel=FALSE, labels=list(formatter=JS("function(){
var n=Math.exp(this.value);
if (n < 1e3) return n.toFixed(0);
if (n >= 1e3 && n < 1e6) return +(n / 1e3).toFixed(1) + 'k';
if (n >= 1e6 && n < 1e9) return +(n / 1e6).toFixed(1) + 'M';
if (n >= 1e9 && n < 1e12) return +(n / 1e9).toFixed(1) + 'B';
if (n >= 1e12) return +(n / 1e12).toFixed(1) + 'T';}"))) %>%
hc_tooltip(useHTML=TRUE,headerFormat='',
pointFormat = '{point.Country} confirmed cases : <span style=" border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: gold !important;" >{point.Total_cases}</span><br/>Deaths: <span style=" color: white !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: orangered !important;" >{point.Total_Deaths} (CFR {point.CFR})</span><br/>Population (2019): <b>{point.pop}</b><br/>Cases per million: <b>{point.Cases_per_mill}</b>') %>%
hc_mapNavigation(enabled = TRUE) %>%
hc_title(text = "Number of confirmed COVID-19 cases by Country",
margin = 40, align = "left",
style = list(color = "#2b908f", useHTML = TRUE)) %>%
hc_subtitle(text = glue('Data origin: <a href="https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide">European Centre for Disease Prevention and Control</a> (updated on {format(max(choroplet_data$Date), "%d/%m/%Y")})'),
align = "left",
style = list(color = "#2b908f", fontWeight = "bold")) %>%
hc_exporting(enabled = TRUE)What else can we plot or investigate?
Please contact me at i.bar@griffith.edu.au for any questions or comments.